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Electric polarizability is an important parameter for the internal structure of hadrons. Previous 
studies of polarizabilities have been done at relatively heavy pion masses, leaving the chiral region 
largely unexplored. In this report, we use overlap fermions which are known to be computation- 
ally demanding to properly capture the chiral dynamics. We present an implementation strategy 
to construct overlap on multi-GPUs. We find that our GPU code has an equivalent of ^ 30 CPU 
cores to 1 GPU. We also present preliminary results for the polarizability of the neutral pion. 
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1. Introduction 

Electric polarizability is an important characterization of the internal structure of hadrons. 
However, current lattice QCD calculations use pion masses > 300 MeV which is a factor of 2 
greater than the physical mass. Yet, at relatively low pion masses that lie above the physical point, 
interesting physics is still predicted to occur. For example, when is smaller than — trin, the 
polarizability is expected to change substantially [1]. 

Simulations performed near the physical point are challenging. The effects of chiral symmetry 
breaking become important at lower masses. Furthermore, exceptional configurations appear more 
frequently in the simulations [2], making the inversion of the Dirac operator more problematic. To 
overcome these difficulties, we use the overlap operator [3] which preserves exact chiral symmetry 
on the lattice and does not suffer from exceptional configurations. However, the overlap operator 
is computationally expensive. Due to this we implement it on Graphic Processing Units (GPUs). 

In recent years, the usage of GPUs has increased in the lattice community [4], [5], [6], [7], 
[8]. They are known to outperform CPUs by large factors. However, one drawback is that they 
have relatively small memory compared to CPUs. The overlap operator is not only numerically 
demanding but it is also memory intensive, forcing us to implement overlap using multi-GPU 
aixhitectures. 

The outline of the paper is as follows: In section 2, we introduce the overlap formalism and the 
method we use to approximate it. Section 3 describes the implementation of overlap on GPUs and 
how we compute overlap quark propagators; here, we also address the issue of memory constraints. 
In section 4, we describe the methodology of how we compute the polarizability. In particular, the 
background field method [9] is discussed. Lastly, we present our results on the neutral pion. 

2. Overlap Formalism 

The Dirac overlap operator is defined as 



where H„ = ysD^ and D^, is the Wilson Dirac operator. It is numerically not feasible to compute 
sign(//n ) exactly. One must use numerical methods to approximate the matrix sign function. There 
are 2 commonly used approximations: polynomial [10] and rational [11] approximations. In this 
study, we use the polynomial approximation. One constructs a polynomial P{x) = {p^ + p\x + 
P2X^ -\- ..p„x") to approximate x^^^^. The matrix sign function is then approximated as sign(//n,) w 



where A = 0.41, b = 2.1. The parameter 5 is the error of the approximation i.e. 5 = 1 1 — ^/xP{x)\. 
The pai^ameter ^/e defines an interval, [—^/e,\/e], around zero where the approximation breaks 
down. Because we want the approximation to be valid over the whole spectrum of H^^„ t/£ needs 
to be less than the smallest eigenvalue of In our runs, we find A„„„ ~ 10^^. The corresponding 
polynomial order is ~ 10^ for 5 = 10^^°. However, this is impractical. 



Dov = 1 + 75 sign(//H.) 



(2.1) 




d=Ae 



(2.2) 
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The general approach to this problem is to divide the approximation into 2 regions which we 
call the small space and large space. The small space is computed exactly by calculating a small 
spectrum of Q around zero. The remaining large space is approximated by sign(//vv) w QP{Q^) 
with a significantly smaller polynomial order. This makes the calculation more feasible. Figure 1 
shows a small order approximation and the breakdown of the 2 regions. 
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Figure 1: Polynomial approximation xP{x^) for the sign function with 5 = 0.1, n — 13, and -y/e = 0.06. 
This is a schematic diagram of how the sign function approximation is divided into 2 parts: small space and 
large space. The green/vertical and red/horizontal region corresponds to the small space and large space, 
respectively. 

We compute the quark propagators using adaptive conjugate gradient (CG) methods. For 
small quark masses, the convergence is very slow. Therefore, we use deflation to help speed up the 
calculation. This consists of computing the low lying spectrum of the overlap operator. For pion 
masses on the order of 200 MeV, deflation was shown to increase the convergence rate by a factor 
of 3-4 [12]. The process of obtaining the eigenmodes for Dgy. is similar to the computation of the 
eigenmodes of H„ that is used for the small space calculation. We now describe how to implement 
these eigensystem and propagator calculations on GPUs. 

3. Overlap implementation on multi-GPUs 

GPUs have been very successful thus far in lattice QCD calculations. They have good float- 
ing point performance and large memory bandwidth. However, a drawback of GPUs is their fixed 
memory size, ranging from 1-6 GB per GPU. Because the CPUs are bottlenecked by communica- 
tion over the PCI bus, it is advantageous, though not always possible, to keep all necessary data on 
the GPU memory. 

In this study, we used a lattice size of 24^ x 64. The calculations were done on a GPU cluster 
consisting of 6GB of memory per GPU. On a single GPU, we can store 36 vectors. We will show 
that due to memory constraints, we cannot compute overlap propagators on a single GPU, forcing 
us to use multi-GPU architectures. 

To compute an overlap quark propagator, there are 2 main routines that need to be imple- 
mented: an eigensystem solver and a multi-mass inverter. The eigensystem solver is used to com- 
pute the eigensystems of H„ and Do,,- We discuss these 2 routines in the following sections. 
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3.1 Eigensystem solver 

The practical implementation of the overlap operator requires the calculation of the lowest 
lying eigenmodes of H„. In order for one to use deflation, we also need to compute the lowest 
eigenmodes of Dov To calculate these eigenmodes, we use implicitly restarted Arnoldi factoriza- 
tion [5]. In this algorithm we build a subspace of the size 2.51, where / is the number of desired 
eigenmodes. In most cases, I is too large so that the required memory cannot fit into a single GPU. 

For the H„ eigensystem the choice of / is dictated by the spectrum of its eigenvalues. It has 
been shown [5] that the eigenvalue distribution of varies little from configuration to configu- 
ration. Therefore, one can calculate a eigensystem from one configuration and use eqn. 2.2 
to compute the corresponding polynomial order to determine an optimal choice of /. We note that 
for a given 5 there is an inverse relation between n and A/. A larger A/ corresponds to a smaller n, 
which is desirable. It has been shown [12] that using smeared operators increases the magnitude 
of the eigenvalues for the low lying eigenmodes. In this study, we smear the gauge links 3 times 
using nhyp smearing [13]. In fig. 2 we plot a small spectrum of //„, as a function of the number of 
eigenvalues for a single configuration. Going from / = 100 to 200, n is reduced by about a factor 
of 2. However, going from 200 to 300 there only is a 15% reduction. Moreover, / = 300 requires 
1.5 times more memory, we therefore choose / = 200. The advantage of using smaller values of / 
is discussed below. 
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Figure 2: Polynomial order to approximate the sign 
function as a function of I. 



Figure 3: Strong scaling of the double precision 
dslash on 24^ x 64 lattice. 



With / = 200 a minimum of 16 GPUs are needed in order to hold this amount of data. One 
should always use the least amount of GPUs required to do the calculation; this is due to the 
scaling of the dslash operator. Recall that the overlap calculation requires a large amount of dslash 
multiplications. Our timings indicate that approximately 70% of the overlap calculation is spent in 
the dslash routine. The scaling of dslash directly affects the scaling of overlap. Figure 3 shows the 
dslash scaling from 1 to 32 GPUs. From the plot we see that it is beneficial to use the least amount 
of GPUs. 

Chebyshev acceleration [14] has also been implemented in this calculation. This serves 2 pur- 
poses. First, it speeds up the convergence of the H„ eigensystem. We use a Chebyshev polynomial 
of order 100. The algorithm usually converges in 1 iteration. Second, it offers an alternative when 
GPU memory alone is insufficient. By using the Chebyshev acceleration in conjunction with the 
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Amoldi eigensolver, we can utilize CPU memory. However, in the case of this may impact the 
performance of the code substantially as explained below. 

In simulations where GPU memory is not sufficient, we implemented a mixed code to utilize 
both CPU and GPU memory. CPU memory is generally much larger than GPU memory; one can 
store all the H„ eigenvectors on the CPU. If the eigensystem solver uses no Chebyshev accelera- 
tion, we would have to transport each vector to and from the CPU memory after a single dslash 
multiplication. The overhead associated with transporting the vectors over the PCI bus would be 
much greater than the time spent in the calculation which would make the code very inefficient. 
However, by using Chebyshev acceleration one needs to perform a few hundred dslash multiplica- 
tions for each vector transported. This hides the communication overhead. The drawback is that 
all vector routines, such as vector additions and scalar products, are computed on the CPU. This 
can impact the performance of the code substantially. In particular, the larger value of / will have 
a worse performance because the number of dslash operations (done on GPU) scales as / while the 
number of vector routines (done on CPU) scale like in the Amoldi algorithm. 

The GPU cluster used for our runs has 6GB of memory per GPU with QDR Infiniband. We 
use 16 GPUs in order to fit all data into GPU memory. To compare the GPU performance with 
CPU codes, we run similar calculations on a Cray XT-5 machine. Since the scaling of the CPU 
codes is poorer than our GPU codes, we run our codes on 256 cores which is the minimum required 
to complete our tests in the time limit imposed by the scheduling system. 

In our runs with 16 GPUs and / = 200 the all GPU code takes 510 seconds, while the mixed 
code takes 1600 seconds. Our all CPU code took 1100 seconds. Compared to the all GPU code 
and the machines we ran on, we find that there is an equivalent of about 34 CPU cores to 1 GPU. 
For the mixed algorithm there is an equivalent of 1 1 CPU cores to 1 GPU. 

Once the eigenvectors have been computed, we can obtain the eigensystems for Dgy which 
will be used for deflation to help speed up the propagator calculations. We again use the Amoldi 
eigensolver to compute the eigensystem in a similar manner as was done with which includes 
using Chebyshev acceleration. A mixed approach using CPU and GPU memory can also be used 
for computing the eigensystem of Dgy if GPU memory is limited. Because Dgy requires all of the 
H^^, eigenvectors for each matrix multiphcation, we keep all eigenvectors on the GPU and all 
Dgy eigenvectors on the CPU. In contrast to what was seen in the case of H^v, there is only a 40% 
slowdown going from an all GPU code to the mixed code. We used a Chebyshev order of 12 for 
these runs. The calculation of the 100 lowest eigenmodes for the overlap operator required a total of 
4.8 hours on our all GPU code and 6.7 hours for our mixed code. The main reason for the increase 
of relative performance of D^v, in comparison to H^, is due to the fact that D^v has to perform 
many more dslash operations for each Amoldi iteration. For the mixed code one should notice that 
the calculation of eigenmodes for Dgv is much more expensive than for Hw. This means that the 
factor of 3 in performance lost in the mixed code is not substantial when considering the whole 
calculation of eigensystems. Again, we can compare our GPU code to an all CPU calculation. The 
all CPU required 11.2 hours on 256 cores for the eigensystem calculation of Dov Thus, there is a 
GPU equivalent count of 37 and 27 for the all GPU and mixed code, respectively. 

3.2 Quark propagator calculation 

To compute Xj/ with a given precision we implemented an a adaptive multi-mass CG 
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method [15]. The precision of D„v can be tuned by changing the order of the polynomial n. As the 
CG process converges less precision is required at each step. This allows us to change the order 
of the polynomial accordingly. Furthermore, at some point the precision required can be achieved 
using single precision arithmetic. This produces an even better performance. For a quark mass 
con^esponding to w 200 MeV and a residue of 10^^, the adaptive CG took 0.8 hours, while the 
non-adaptive CG required 1.6 hours. 

The most expensive procedure for computing an overlap quark propagator is the calculation 
of eigensy stems for D,,,,. Altogether, a complete quark propagator calculation for this work takes 
roughly 6 hours using 16 GPUs with the all GPU code. We apply this overlap implementation to 
compute the polarizability of hadrons. We will need 5 quark propagators per configuration in order 
to calculate the polarizability. We now turn our attention to the methodology for extracting the 
polarizability. 



4. Methodology and Results 

We use the background field method to introduce a constant electric field onto the lattice. The 
basic formulation is to modify the covariant derivative in the following way 



D^=d^-iG^-iqA^, (4.1) 

where q is the electric charge. and are the gluon and photon fields, respectively. has the 
effect that it multiplies all the gauge links by an extra U{1) phase factor i.e. — e^'^'^'^U^. 

In this work, we focus on the neutral pion polarizability. One method to extract the energies 
for the neutral pion is by looking at the ratio of correlation functions. This is given by 

^ ' Gait) e-'"' ' ^ ^ 

where Ge and Go are the correlation functions with and without the electric field, respectively. 
Am is the desired energy shift of the hadron. Boundary conditions play an important role for the 
background field method [16]. In this study, we use Dirichlet boundary conditions both in the 
direction of the applied electric field and time. The boundary condition in the spatial direction 
has the effect of producing a non-zero total momentum for the system. In order to extract the 
polarizability, one needs to first subtract off the additional energy based on the fact that the pion is 
moving. 

The calculation of R{t) is relatively expensive. To obtain a better signal for G£(?) we compute 
propagators with both positive and negative values of the electric field. Furthermore, for each non- 
zero value of the electric field we need to compute u and d propagators separately. This is because 
the u and d quarks couple differently to the external field due to their different electric charges. 
Thus, we need to compute 4 quark propagators to calculate Gsit) and 1 to calculate Go{t), a 
total of 5 propagators per configuration. This requires 5 different calculations of the and Dov 
eigensystems. 

We use 40 configurations of the 2-1-1 domain wall fermion gauge configurations [17] on 24^ x 
64 lattices with a^^ = 1.73(3) GeV and a pseudoscalai^ sea mass 330 MeV. Each configuration 
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was smeared 3 times using nhyp smearing. We used 4 different source positions, equalling a total of 
160 propagators for each value of the external field. Adding a new source is relatively inexpensive 
since the H„ and Dov eigensystems do not have to be recomputed. A total of 7 masses were used, 
with the lowest mass roughly 240 MeV. 

Figure 4 shows a plot of the extracted polarizabilities for the neutral pion. We see that at 
~ 500 MeV the polarizability changes sign. This is consistent with other studies [2]. Our error 
bars are rather large due to small statistics used. We are currently generating 4 times the statistics 
to get a cleaner signal at lower masses. 

n„ polarizability 

\ 



500 1000 1500 2000 

m;r[MeV] 

Figure 4: ttq polarizability. 



5. Conclusion 

The calculation with overlap fermions is very computationally demanding. However, we have 
constructed an efficient implementation of the overlap operator on CPUs. We presented a strategy 
to utilize both CPU and GPU memory if GPU memory alone is insufficient to hold all of the 
required data. Our simulations with the all GPU code indicate that a single GPU performance is 
equivalent to roughly ~ 30 CPU cores on the machines used. We have also presented preliminary 
results of the neutral pion polarizability using overlap fermions. We are currently in the process of 
generating a factor of 4 in statistics in order to reduce our error bars. 
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